-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix avaibility reporting for DC with MinReadySeconds set #14936
Fix avaibility reporting for DC with MinReadySeconds set #14936
Conversation
@tnozicka this seems pretty big change for 3.6 |
Let's get the fix that propagates MinReadySeconds in RCs and the rest can be deferred to 3.7 (I am in favor of deferring) |
00f6fb5
to
de420ff
Compare
// FIXME: remove when tests are migrated to the new client | ||
// (the old one incorrectly translates nil into an empty array) | ||
dc.Spec.Triggers = append(dc.Spec.Triggers, deployapi.DeploymentTriggerPolicy{Type: deployapi.DeploymentTriggerOnConfigChange}) | ||
dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing error check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad rebase, fixing it. thx. And I guess I know where that check went :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc) | ||
|
||
g.By("verifying the deployment is created") | ||
rcEvent, err := watch.Until(deploymentChangeTimeout, watcher, func(event watch.Event) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
watch.Until is going to flake - use wait.Poll for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis why is it going to flake? If you mean that wait can fail in such short interval I guess retry would be an option. I can't do polling - because I would loose eventual consistency and have a bad resourceVersion and that would cause a flake for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue with watch.Until that is yet to be fixed and causes stuff like rollout status
to break unexpectedly: kubernetes/kubernetes#40224 cc: @mfojtik
Not sure what you mean re resourceVersion, the log above says " verifying the deployment is created ", how can polling not work in that case?
return false, nil | ||
} | ||
g.By("verifying that the deployment is still running") | ||
o.Expect(err).NotTo(o.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant error check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's where it went :)
de420ff
to
7047cf9
Compare
@@ -1,6 +1,7 @@ | |||
package deployments | |||
|
|||
import ( | |||
//"errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop it, not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also commented parts of code which use it this is because this PR was split in half and the next one uncomments it. Also felt good to leave it because a had to rename kubernetes errors package to kerrors because of it
ready++ | ||
g.By("waiting for the deployment to finish") | ||
rc1, err = waitForRCModification(oc, namespace, rc1.Name, | ||
deploymentChangeTimeout+time.Duration(dc.Spec.MinReadySeconds)*10*time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability reasons put it in its own line.
@Kargakis I cant' test this behaviour without WATCH. It already revealed major bug with a race between RC controller and deployer pod. I can't be tested without it. Polling sux for testing this and it causes errors not to be discovered. @Kargakis Can we really hit that bug in tests? I though it was closed because of long timeout or too many events. |
I am not sure this holds true. The bug that you uncovered regarding the RC controller and the deployer pod is unrelated to the poll vs watch argument. Here, you want to observe that the RC exists. Having a Poll that exits as soon as you get a 200 back has nothing to do with resourceVersion checks?
We have hit problems with watch.Until already upstream: |
@Kargakis but the below watches do - you would miss that state when it's broken if you do polling instead of watch. To be concrete here you would miss the state when availableReplicas==0 and deployment-phase==Complete because with polling you would see just that reconciled availableReplicas==2 and deployment-phase==Complete |
7047cf9
to
bc2690d
Compare
(just removed a debug *10 seconds on watch) |
I believe we should not sacrifice correctness for a potential flake. I have tested it in a loop and haven't seen it. From what I have seen in upstream it happens with longer timeout. These tests have the longest timeout about 1m30s and a slow rate of events. Upstream mentions @Kargakis here is the deal: If that will flake I will go and fix watch.Until myself! It should be also simple to hot fix it if needed - just have a wrapper that validates the timeout and inspects events to know the last resourceVersion and restart it from there with the remaining timeout. I think @soltysh was suggesting a similar fix. I don't want to fix watch as part of this to make it for code freeze and due to the low timeout it shouldn't be affected. |
kubernetes/kubernetes#47697 (comment) talks about 2s but I won't bikeshed on this anymore ... You own the test now :) |
new-app flake #14043 (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2819/) |
Ok, I was kind of expecting that. Because @bparees removed kind/test.flake label in #14043 (comment) I can't re-trigger the test now :( I will rebase and push (no changes) |
bc2690d
to
953dac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@stevekuznetsov seems like the node failed to restart but we dunno why: Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details. re-[test] |
Evaluated for origin test up to 953dac8 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2923/) (Base Commit: fc34104) (PR Branch Commit: 953dac8) |
[merge][severity:blocker] |
@mfojtik as per the e-mail out to aos-devel -- we capture the node logs. Look under |
this was GCE, so no fun :/ |
GCE Flake #15029 |
/cc @zhouying7780 |
[merge] |
Evaluated for origin merge up to 953dac8 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1223/) (Base Commit: ecd7763) (PR Branch Commit: 953dac8) (Extended Tests: blocker) (Image: devenv-rhel7_6420) |
Automatic merge from submit-queue Fix minReadySeconds for DC Follow up to: #14936 (needs to be merged first) Make AvailableReplicas work with MinReadySeconds set. Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time. [Outdated] At this time ignore the first 2 commits which are part of #14936 (because that isn't merged yet)
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1463594
Makes AvailableReplicas work with MinReadySeconds set. Also removes obsolete counting of pods which makes it overlap with AvailableReplicas.
@Kargakis PTAL. I want to be sure there was no gotcha for not using RC with MinReadySeconds except it might not be available at the time this was written.
cc: @mfojtik